141 research outputs found

    Universal sequence map (USM) of arbitrary discrete sequences

    Get PDF
    BACKGROUND: For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis – without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units. RESULTS: We have successfully identified such an iterative function for bijective mappingψ of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov Chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/. CONCLUSIONS: USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules

    Computing distribution of scale independent motifs in biological sequences

    Get PDF
    The use of Chaos Game Representation (CGR) or its generalization, Universal Sequence Maps (USM), to describe the distribution of biological sequences has been found objectionable because of the fractal structure of that coordinate system. Consequently, the investigation of distribution of symbolic motifs at multiple scales is hampered by an inexact association between distance and sequence dissimilarity. A solution to this problem could unleash the use of iterative maps as phase-state representation of sequences where its statistical properties can be conveniently investigated. In this study a family of kernel density functions is described that accommodates the fractal nature of iterative function representations of symbolic sequences and, consequently, enables the exact investigation of sequence motifs of arbitrary lengths in that scale-independent representation. Furthermore, the proposed kernel density includes both Markovian succession and currently used alignment-free sequence dissimilarity metrics as special solutions. Therefore, the fractal kernel described is in fact a generalization that provides a common framework for a diverse suite of sequence analysis techniques

    Twiner: correlation-based regularization for identifying common cancer gene signatures

    Get PDF
    © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Background: Breast and prostate cancers are typical examples of hormone-dependent cancers, showing remarkable similarities at the hormone-related signaling pathways level, and exhibiting a high tropism to bone. While the identification of genes playing a specific role in each cancer type brings invaluable insights for gene therapy research by targeting disease-specific cell functions not accounted so far, identifying a common gene signature to breast and prostate cancers could unravel new targets to tackle shared hormone-dependent disease features, like bone relapse. This would potentially allow the development of new targeted therapies directed to genes regulating both cancer types, with a consequent positive impact in cancer management and health economics. Results: We address the challenge of extracting gene signatures from transcriptomic data of prostate adenocarcinoma (PRAD) and breast invasive carcinoma (BRCA) samples, particularly estrogen positive (ER+), and androgen positive (AR+) triple-negative breast cancer (TNBC), using sparse logistic regression. The introduction of gene network information based on the distances between BRCA and PRAD correlation matrices is investigated, through the proposed twin networks recovery (twiner) penalty, as a strategy to ensure similarly correlated gene features in two diseases to be less penalized during the feature selection procedure. Conclusions: Our analysis led to the identification of genes that show a similar correlation pattern in BRCA and PRAD transcriptomic data, and are selected as key players in the classification of breast and prostate samples into ER+ BRCA/AR+ TNBC/PRAD tumor and normal tissues, and also associated with survival time distributions. The results obtained are supported by the literature and are expected to unveil the similarities between the diseases, disclose common disease biomarkers, and help in the definition of new strategies for more effective therapies.This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references UID/EEA/50008/2019 (Instituto de Telecomunicações), UID/CEC/50021/2019 (INESC-ID), UID/EMS/50022/2019 (IDMEC, LAETA), PREDICT (PTDC/CCI-CIF/29877/2017), and PERSEIDS (PTDC/EMS-SIS/0642/2014).info:eu-repo/semantics/publishedVersio

    Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data

    Get PDF
    Correct classification of breast cancer sub-types is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer (TNBC) which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma (BRCA) transcriptomic data publicly available from The Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail in the presence of these outliers, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60\% have been previously reported as biologically relevant to TNBC, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between TNBC and non-TNBC data. The individual role of FOXA1 in TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not only will our results contribute to the breast cancer/TNBC understanding and ultimately its management, they also show that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data

    Optimization strategies for metabolic networks

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The increasing availability of models and data for metabolic networks poses new challenges in what concerns optimization for biological systems. Due to the high level of complexity and uncertainty associated to these networks the suggested models often lack detail and liability, required to determine the proper optimization strategies. A possible approach to overcome this limitation is the combination of both kinetic and stoichiometric models. In this paper three control optimization methods, with different levels of complexity and assuming various degrees of process information, are presented and their results compared using a prototype network.</p> <p>Results</p> <p>The results obtained show that Bi-Level optimization lead to a good approximation of the optimum attainable with the full information on the original network. Furthermore, using Pontryagin's Maximum Principle it is shown that the optimal control for the network in question, can only assume values on the extremes of the interval of its possible values.</p> <p>Conclusions</p> <p>It is shown that, for a class of networks in which the product that favors cell growth competes with the desired product yield, the optimal control that explores this trade-off assumes only extreme values. The proposed Bi-Level optimization led to a good approximation of the original network, allowing to overcome the limitation on the available information, often present in metabolic network models. Although the prototype network considered, it is stressed that the results obtained concern methods, and provide guidelines that are valid in a wider context.</p

    Fractal MapReduce decomposition of sequence alignment

    Get PDF
    This work was supported in part by the Center for Clinical and Translational Sciences of the University of Alabama at Birmingham under contract no. 5UL1 RR025777-03 from NIH National Center for Research Resources, by the National Cancer Institute grant 1U24CA143883-01, by the European Union FP7 PNEUMOPATH (HEALTH F3 2009 222983).Background: The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required. Results: In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming. Conclusions: The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing. Availability: Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".publishersversionpublishe

    RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data

    Get PDF
    We thank Peter Segaert for providing his adapted code of the enetLTS method. The results presented here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga . Funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2075 390740016 (AJ and NR). Publisher Copyright: © The Author(s) 2022.The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy. Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods. We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of (Formula presented.) genes and more than (Formula presented.) samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.publishersversionpublishe

    AliClu - Temporal sequence alignment for clustering longitudinal clinical data

    Get PDF
    The authors acknowledge funding the Portuguese Foundation for Science and Technology (Fundação para a Ciência e a Tecnologia - FCT) under contracts INESC-ID (UID/CEC/50021/2019) and IT (UID/EEA/50008/2019), projects PREDICT (PTDC/CCI-CIF/29877/2017), PERSEIDS (PTDC/EMS-SIS/0642/2014) and NEUROCLINOMICS2 (PTDC/EEI-SII/1937/2014). The funders had no role in the design of the study, collection, analysis and interpretation of data, or writing the manuscript.BACKGROUND: Patient stratification is a critical task in clinical decision making since it can allow physicians to choose treatments in a personalized way. Given the increasing availability of electronic medical records (EMRs) with longitudinal data, one crucial problem is how to efficiently cluster the patients based on the temporal information from medical appointments. In this work, we propose applying the Temporal Needleman-Wunsch (TNW) algorithm to align discrete sequences with the transition time information between symbols. These symbols may correspond to a patient's current therapy, their overall health status, or any other discrete state. The transition time information represents the duration of each of those states. The obtained TNW pairwise scores are then used to perform hierarchical clustering. To find the best number of clusters and assess their stability, a resampling technique is applied. RESULTS: We propose the AliClu, a novel tool for clustering temporal clinical data based on the TNW algorithm coupled with clustering validity assessments through bootstrapping. The AliClu was applied for the analysis of the rheumatoid arthritis EMRs obtained from the Portuguese database of rheumatologic patient visits (Reuma.pt). In particular, the AliClu was used for the analysis of therapy switches, which were coded as letters corresponding to biologic drugs and included their durations before each change occurred. The obtained optimized clusters allow one to stratify the patients based on their temporal therapy profiles and to support the identification of common features for those groups. CONCLUSIONS: The AliClu is a promising computational strategy to analyse longitudinal patient data by providing validated clusters and by unravelling the patterns that exist in clinical outcomes. Patient stratification is performed in an automatic or semi-automatic way, allowing one to tune the alignment, clustering, and validation parameters. The AliClu is freely available at https://github.com/sysbiomed/AliClu.publishersversionpublishe

    Multi-level dynamic modeling in biological systems : application of hybrid Petri nets to network simulation

    Get PDF
    The recent progress in the high-throughput experimental technologies allows the reconstruction of many biological networks and to evaluate changes in proteins, genes and metabolites levels in different conditions. On the other hand, computational models, when complemented with regulatory information, can be used to predict the phenotype of an organism under different genetic and environmental conditions. These computational methods can be used for example to identify molecular targets capable of inactivating a bacterium and to understand its virulence factors. This work proposes a hybrid metabolic-regulatory Petri net approach that is based on the combination of approximate enzyme-kinetic rate laws and Petri nets. A prototypic network model is used as a test-case to illustrate the application of these concepts in Systems Biology.This work was partially supported by post-doctoral grant by Fundacao para a Ciencia e a Tecnologia (FCT) (SFRH/BPD/80784/2011), project PneumoSyS - A Systems Biology approach to the role of pneumococcal carbon metabolism in colonization and invasive disease (FCT contract: PTDC/SAU-MII/100964/2008) and by FCT (INESC-ID multiannual funding) through the PIDDAC program funds
    corecore